Housing dataset from https://www.kaggle.com/datasets/vikramamin/housing-linear-regression

1 Data import

housing <- read.csv("housing.csv", dec = ".", stringsAsFactors = FALSE)
library(skimr)
skim(housing)
Data summary
Name housing
Number of rows 5000
Number of columns 7
_______________________
Column type frequency:
character 1
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Address 0 1 21 68 0 5000 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Avg..Area.Income 0 1 68583.11 10657.99 17796.63 61480.56 68804.29 75783.34 107701.75 ▁▁▇▆▁
Avg..Area.House.Age 0 1 5.98 0.99 2.64 5.32 5.97 6.65 9.52 ▁▃▇▃▁
Avg..Area.Number.of.Rooms 0 1 6.99 1.01 3.24 6.30 7.00 7.67 10.76 ▁▃▇▃▁
Avg..Area.Number.of.Bedrooms 0 1 3.98 1.23 2.00 3.14 4.05 4.49 6.50 ▅▇▇▃▃
Area.Population 0 1 36163.52 9925.65 172.61 29403.93 36199.41 42861.29 69621.71 ▁▃▇▅▁
Price 0 1 1232072.65 353117.63 15938.66 997577.14 1232669.38 1471210.20 2469065.59 ▁▃▇▃▁
library(summarytools)
dfSummary(housing,plain.ascii = FALSE, style = "grid",tmp.img.dir  = "/tmp",graph.magnif = 0.85)

1.0.1 Data Frame Summary

1.0.1.1 housing

Dimensions: 5000 x 7
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 Avg..Area.Income
[numeric]
Mean (sd) : 68583.1 (10658)
min < med < max:
17796.6 < 68804.3 < 107701.7
IQR (CV) : 14302.8 (0.2)
5000 distinct values 5000
(100.0%)
0
(0.0%)
2 Avg..Area.House.Age
[numeric]
Mean (sd) : 6 (1)
min < med < max:
2.6 < 6 < 9.5
IQR (CV) : 1.3 (0.2)
5000 distinct values 5000
(100.0%)
0
(0.0%)
3 Avg..Area.Number.of.Rooms
[numeric]
Mean (sd) : 7 (1)
min < med < max:
3.2 < 7 < 10.8
IQR (CV) : 1.4 (0.1)
5000 distinct values 5000
(100.0%)
0
(0.0%)
4 Avg..Area.Number.of.Bedrooms
[numeric]
Mean (sd) : 4 (1.2)
min < med < max:
2 < 4 < 6.5
IQR (CV) : 1.4 (0.3)
255 distinct values 5000
(100.0%)
0
(0.0%)
5 Area.Population
[numeric]
Mean (sd) : 36163.5 (9925.7)
min < med < max:
172.6 < 36199.4 < 69621.7
IQR (CV) : 13457.4 (0.3)
5000 distinct values 5000
(100.0%)
0
(0.0%)
6 Price
[numeric]
Mean (sd) : 1232073 (353117.6)
min < med < max:
15938.7 < 1232669 < 2469066
IQR (CV) : 473633.1 (0.3)
5000 distinct values 5000
(100.0%)
0
(0.0%)
7 Address
[character]
1. 000 Adkins Crescent South
2. 000 Todd Pines Ashleyberg
3. 001 Steve Plaza Jessicast
4. 0010 Gregory Loaf South E
5. 00149 Raymond Knolls New
6. 002 Katherine Flat Hartma
7. 0022 Young Rest Lake Kevi
8. 0029 Melinda Neck Apt. 59
9. 003 Erica Passage Apt. 27
10. 003 Fernando Gateway Suit
[ 4990 others ]
1 ( 0.0%)
1 ( 0.0%)
1 ( 0.0%)
1 ( 0.0%)
1 ( 0.0%)
1 ( 0.0%)
1 ( 0.0%)
1 ( 0.0%)
1 ( 0.0%)
1 ( 0.0%)
4990 (99.8%)
5000
(100.0%)
0
(0.0%)

1.1 Missing values

sum(is.na(housing))
## [1] 0
sum(!complete.cases(housing))
## [1] 0
apply(is.na(housing), 2, sum)
##             Avg..Area.Income          Avg..Area.House.Age 
##                            0                            0 
##    Avg..Area.Number.of.Rooms Avg..Area.Number.of.Bedrooms 
##                            0                            0 
##              Area.Population                        Price 
##                            0                            0 
##                      Address 
##                            0

1.2 Conclusion

No empty rows found. The address can be discarded as it does not provide additional information. No factors need to be formed as all values are numeric.

The address is removed from the set because all values are distinct strings.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
housing <-housing %>%
    select(c(
    "Avg..Area.Income",
    "Avg..Area.House.Age",
    "Avg..Area.Number.of.Rooms",
    "Avg..Area.Number.of.Bedrooms",
    "Area.Population",
    "Price"
  ))

2 Descriptive data analysis

The descriptive data analysis consists of presenting:

  • Diagrams of the distribution of characteristics to show how the values of certain (non)numeric characteristic is spread or distributed across the dataset. The diagrams can include histograms, density plots, and box plots which help understanding the tendency, dispersion, and shape of the data distribution.
    • Tendency: A central or typical value of a dataset (e.g avg, mean, median, mode).
    • Dispersion(variability): Shows how spread out the values in a dataset are compared to the central point. This includes the range, interquartile range (IQR, range of the middle 50,75%), variance, and standard deviation.
      • Range: The difference between the highest and lowest points. It gives a quick sense about the spread, but is heavily influence by the outliers.
      • Variance: Average of the squared difference from the mean. Provides good insights into the variability within a dataset, by considering how each datapoint differs form the mean.
      • Standard deviation: Sqrt of variance, indicating that the datapoints are about x (std) units from the mean. It provides insights into the dispersion of the data.
  • Frequency tables to identify the most (un)common values and to understand the distribution of categorical data.

2.1 Price

The target property Y=Price is a continuous variable.

Let’s visualize the distribution using a histogram.

Price_hist <- hist(
  housing$Price, 
  col = "skyblue",
  main = "Price frequency",
  xlab="Price",
  ylab = "Frequency"
)

The relative distribution:

par(mfrow=c(1,2))

Price_hist$counts <- Price_hist$counts / sum(Price_hist$counts) * 100

plot(
  Price_hist,
  col = "skyblue",
  main = "Price % distribution",
  xlab="Price",
  ylab = "Frequency %",
  cex.main=0.8
)

Price_box <- boxplot(
  housing$Price, 
  col = "skyblue",
  horizontal = 1,
  main = "Price box plot",
  xlab = "Price",
  cex.main = 0.8,
  range = 2
)

Price has outliers on both sides are:

sort(Price_box$out)
## [1]   15938.66   31140.52 2469065.59

The 2 outliers on the left side have the values of 16e3 and 31e3. The 1 outlier on the right has a value of 2.5e6; Details of the 3 outliers:

Price_out <- which(housing$Price %in% Price_box$out)

library(DT)
datatable(
  housing[Price_out, ],
  options=list(scrollX=1,pageLength=3,searching = FALSE,scroller = TRUE,scrollY=200)
)

Frequency table of Price (first the data has to be cut() into intervals based on the histogram).

intervals <- cut(housing$Price, Price_hist$breaks, include.lowest = TRUE)
col_frequency <- table(intervals)
col_frequency_rel <- prop.table(col_frequency) * 100
Price_ftable <- cbind(rownames(col_frequency), col_frequency, round(col_frequency_rel, 2), round(cumsum(col_frequency_rel), 2))
Price_ftable <- rbind(Price_ftable, c("Total", sum(col_frequency), sum(col_frequency_rel), ""))
Price_ftable <- data.frame(Price_ftable, row.names = NULL)
names(Price_ftable) <- c("Price", "Frequency", "Frequency %", "Cumulative f. %")

library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
kable_styling(
  kable(
    Price_ftable,
    align = c('l','r','r', 'r')
  ),
  bootstrap_options = c("striped","hover"),
  full_width = 0,
  position = "left"
)
Price Frequency Frequency % Cumulative f. %
[0,2e+05] 6 0.12 0.12
(2e+05,4e+05] 42 0.84 0.96
(4e+05,6e+05] 133 2.66 3.62
(6e+05,8e+05] 389 7.78 11.4
(8e+05,1e+06] 693 13.86 25.26
(1e+06,1.2e+06] 1054 21.08 46.34
(1.2e+06,1.4e+06] 1087 21.74 68.08
(1.4e+06,1.6e+06] 867 17.34 85.42
(1.6e+06,1.8e+06] 461 9.22 94.64
(1.8e+06,2e+06] 194 3.88 98.52
(2e+06,2.2e+06] 59 1.18 99.7
(2.2e+06,2.4e+06] 14 0.28 99.98
(2.4e+06,2.6e+06] 1 0.02 100
Total 5000 100

Numerical characteristics of the variable Price:

library(FSA)
## ## FSA v0.9.5. See citation('FSA') if used in publication.
## ## Run fishR() for related website and fishR('IFAR') for related book.
Price_summary <- Summarize(housing$Price, digits = 2)
kable_styling(kable(t(Price_summary)))
n mean sd min Q1 median Q3 max
5000 1232073 353117.6 15938.66 997577.1 1232669 1471210 2469066

Price is quite symmetric as indicated by its mean (1232073) marginally exceeding the median (1232669). A majority, 42%, of the Price observations are clustered within a 1e6-1.5e6 range. The dataset of this variable ranges from a minimum of 15e3 to a maxiumum of 2.4e6.

2.2 Avg..Area.Income

The target property Y=Avg..Area.Income is a continuous variable.

Let’s visualize the distribution using a histogram.

Avg..Area.Income_hist <- hist(
  housing$Avg..Area.Income, 
  col = "skyblue",
  main = "Avg income frequency",
  xlab="Avg income",
  ylab = "Frequency",
  breaks = 12
)

The relative distribution:

par(mfrow=c(1,2))

Avg..Area.Income_hist$counts <- Avg..Area.Income_hist$counts / sum(Avg..Area.Income_hist$counts) * 100

plot(
  Avg..Area.Income_hist,
  col = "skyblue",
  main = "Avg income % distribution",
  xlab="Avg income",
  ylab = "Frequency %",
  cex.main=0.8
)

Avg..Area.Income_box <- boxplot(
  housing$Avg..Area.Income, 
  col = "skyblue",
  horizontal = 1,
  main = "Avg income box plot",
  xlab = "Avg income",
  cex.main = 0.8,
  range = 1.9
)

Avg..Area.Income_hist has an asymmetric left side. The outliers are:

sort(Avg..Area.Income_box$out)
## [1]  17796.63 104702.72 107701.75

The outlier on the left side has a value of ~17769 and the values of the 2 outliers on right are 104e3 and 107e3. Details of the 3 outliers:

Avg..Area.Income_out <- which(housing$Avg..Area.Income %in% Avg..Area.Income_box$out)

library(DT)
datatable(
  housing[Avg..Area.Income_out, ],
  options=list(scrollX=1,pageLength=3,searching = FALSE,scroller = TRUE,scrollY=200)
)

Frequency table of Avg..Area.Income.

intervals <- cut(housing$Avg..Area.Income, Avg..Area.Income_hist$breaks, include.lowest = TRUE)
col_frequency <- table(intervals)
col_frequency_rel <- prop.table(col_frequency) * 100
Avg..Area.Income_ftable <- cbind(rownames(col_frequency), col_frequency, round(col_frequency_rel, 2), round(cumsum(col_frequency_rel), 2))
Avg..Area.Income_ftable <- rbind(Avg..Area.Income_ftable, c("Total", sum(col_frequency), sum(col_frequency_rel), ""))
Avg..Area.Income_ftable <- data.frame(Avg..Area.Income_ftable, row.names = NULL)
names(Avg..Area.Income_ftable) <- c("Avg..Area.Income", "Frequency", "Frequency %", "Cumulative f. %")

library(knitr)
library(kableExtra)

kable_styling(
  kable(
    Avg..Area.Income_ftable,
    align = c('l','r','r', 'r')
  ),
  bootstrap_options = c("striped","hover"),
  full_width = 0,
  position = "left"
)
Avg..Area.Income Frequency Frequency % Cumulative f. %
[1e+04,2e+04] 1 0.02 0.02
(2e+04,3e+04] 0 0 0.02
(3e+04,4e+04] 18 0.36 0.38
(4e+04,5e+04] 197 3.94 4.32
(5e+04,6e+04] 822 16.44 20.76
(6e+04,7e+04] 1706 34.12 54.88
(7e+04,8e+04] 1566 31.32 86.2
(8e+04,9e+04] 588 11.76 97.96
(9e+04,1e+05] 95 1.9 99.86
(1e+05,1.1e+05] 7 0.14 100
Total 5000 100

Numerical characteristics of the variable Avg..Area.Income:

library(FSA)

Avg..Area.Income_summary <- Summarize(housing$Avg..Area.Income, digits = 2)
kable_styling(kable(t(Avg..Area.Income_summary)))
n mean sd min Q1 median Q3 max
5000 68583.11 10657.99 17796.63 61480.56 68804.29 75783.34 107701.8

Avg..Area.Income is quite symmetric, with a slight left skew, as indicated by its mean (68583) marginally deceeding the median (68804). A majority, 65%, of the Avg..Area.Income observations are clustered within a 60k to 80k range. The dataset of this variable ranges from a minimum of 17e3 to a maxiumum of 107e3.

2.3 Avg..Area.House.Age

The target property Y=Avg..Area.House.Age is a continuous variable.

Let’s visualize the distribution using a histogram.

Avg..Area.House.Age_hist <- hist(
  housing$Avg..Area.House.Age, 
  col = "skyblue",
  main = "Avg house age frequency",
  xlab="Avg house age",
  ylab = "Frequency"
)

The relative distribution:

par(mfrow=c(1,2))

Avg..Area.House.Age_hist$counts <- Avg..Area.House.Age_hist$counts / sum(Avg..Area.House.Age_hist$counts) * 100

plot(
  Avg..Area.House.Age_hist,
  col = "skyblue",
  main = "Avg house age % distribution",
  xlab="Avg house age",
  ylab = "Frequency %",
  cex.main=0.8
)

Avg..Area.House.Age_box <- boxplot(
  housing$Avg..Area.House.Age, 
  col = "skyblue",
  horizontal = 1,
  main = "Avg house age box plot",
  xlab = "Avg house age",
  cex.main = 0.8,
  range = 1.6
)

Avg..Area.House.Age has a slight asymmetric right side. The outliers are:

sort(Avg..Area.House.Age_box$out)
##  [1] 2.644304 2.683043 2.797215 2.797619 2.922736 3.004717 3.055370 3.105751
##  [9] 3.118802 3.144894 8.899713 8.916093 8.973441 8.991399 9.008900 9.125283
## [17] 9.519088

The outliers are in the ranges of ~2.64 to ~3.14 (in total 10) and ~8.92 to ~9.51 (in total 7). Details of the 17 outliers:

Avg..Area.House.Age_out <- which(housing$Avg..Area.House.Age %in% Avg..Area.House.Age_box$out)

library(DT)
datatable(
  housing[Avg..Area.House.Age_out, ],
  options=list(scrollX=1,pageLength=7,searching = FALSE,scroller = TRUE,scrollY=200)
)

2.4 Avg..Area.Number.of.Rooms

The target property Y=Avg..Area.Number.of.Rooms is a continuous variable.

Let’s visualize the distribution using a histogram.

Avg..Area.Number.of.Rooms_hist <- hist(
  housing$Avg..Area.Number.of.Rooms, 
  col = "skyblue",
  main = "Avg rooms frequency",
  xlab="Avg rooms",
  ylab = "Frequency"
)

The relative distribution:

par(mfrow=c(1,2))

Avg..Area.Number.of.Rooms_hist$counts <- Avg..Area.Number.of.Rooms_hist$counts / sum(Avg..Area.Number.of.Rooms_hist$counts) * 100

plot(
  Avg..Area.Number.of.Rooms_hist,
  col = "skyblue",
  main = "Avg rooms % distribution",
  xlab="Avg rooms",
  ylab = "Frequency %",
  cex.main=0.8
)

Avg..Area.Number.of.Rooms_box <- boxplot(
  housing$Avg..Area.Number.of.Rooms, 
  col = "skyblue",
  horizontal = 1,
  main = "Avg rooms box plot",
  xlab = "Avg rooms",
  cex.main = 0.8,
  range = 1.65
)

Avg..Area.Number.of.Rooms has a slight asymmetric left side. The outliers are:

sort(Avg..Area.Number.of.Rooms_box$out)
##  [1]  3.236194  3.950225  3.950973  3.969632  4.027931  9.926147 10.024375
##  [8] 10.144988 10.219902 10.280022 10.759588

The outliers are in the ranges of ~3.23 to ~4.03 and ~9.93 to ~10.6. Details of the outliers:

Avg..Area.Number.of.Rooms_out <- which(housing$Avg..Area.Number.of.Rooms %in% Avg..Area.Number.of.Rooms_box$out)

library(DT)
datatable(
  housing[Avg..Area.House.Age_out, ],
  options=list(scrollX=1,pageLength=7,searching = FALSE,scroller = TRUE,scrollY=200)
)

2.5 Avg..Area.Number.of.Bedrooms

The target property Y=Avg..Area.Number.of.Bedrooms is a continuous variable.

Let’s visualize the distribution using a histogram.

Avg..Area.Number.of.Bedrooms_hist <- hist(
  housing$Avg..Area.Number.of.Bedrooms, 
  col = "skyblue",
  main = "Avg bedrooms frequency",
  xlab="Avg bedrooms",
  ylab = "Frequency",
  breaks = c(2, 3, 4, 5, 6, 7)
)

The relative distribution:

par(mfrow=c(1,2))

Avg..Area.Number.of.Bedrooms_hist$counts <- Avg..Area.Number.of.Bedrooms_hist$counts / sum(Avg..Area.Number.of.Bedrooms_hist$counts) * 100

plot(
  Avg..Area.Number.of.Bedrooms_hist,
  col = "skyblue",
  main = "Avg bedrooms % distribution",
  xlab="Avg bedrooms",
  ylab = "Frequency %",
  cex.main=0.8
)

Avg..Area.Number.of.Bedrooms_box <- boxplot(
  housing$Avg..Area.Number.of.Bedrooms, 
  col = "skyblue",
  horizontal = 1,
  main = "Avg bedrooms box plot",
  xlab = "Avg bedrooms",
  cex.main = 0.8,
  range = 1.5
)

Avg..Area.Number.of.Bedrooms has a slight asymmetric right side. No outliers were detected:

sort(Avg..Area.Number.of.Bedrooms_box$out)
## numeric(0)

2.6 Area.Population

The target property Y=Area.Population is a continuous variable.

Let’s visualize the distribution using a histogram.

Area.Population_hist <- hist(
  housing$Area.Population, 
  col = "skyblue",
  main = "Area population frequency",
  xlab="Population",
  ylab = "Frequency"
)

The relative distribution:

par(mfrow=c(1,2))

Area.Population_hist$counts <- Area.Population_hist$counts / sum(Area.Population_hist$counts) * 100

plot(
  Area.Population_hist,
  col = "skyblue",
  main = "Area population % distribution",
  xlab="Population",
  ylab = "Frequency %",
  cex.main=0.8
)

Area.Population_box <- boxplot(
  housing$Area.Population, 
  col = "skyblue",
  horizontal = 1,
  main = "Area population box plot",
  xlab = "Population rooms",
  cex.main = 0.8,
  range = 1.5
)

Area.Population has a slight asymmetric left side. The outliers are:

sort(Area.Population_box$out)
##  [1]   172.6107  3285.4505  3883.4482  4114.4894  5727.4859  6248.7561
##  [7]  6805.7408  6821.9502  7234.9635  7360.2952  7522.3331  9193.8332
## [13] 63184.6131 63620.0120 64149.6802 64180.3708 64543.3224 64566.6874
## [19] 65184.5785 65857.9333 66995.4740 67353.9652 67601.2236 67701.6498
## [25] 67727.2291 68311.6958 69553.9883 69575.4495 69592.0402 69621.7134

The outliers are in the ranges of ~170 to ~9200 and ~63000 to ~7000. Details of the outliers:

Area.Population_out <- which(housing$Area.Population %in% Area.Population_box$out)

library(DT)
datatable(
  housing[Area.Population_out, ],
  options=list(scrollX=1,pageLength=7,searching = FALSE,scroller = TRUE,scrollY=200)
)

2.7 All outliers

In total 93 outliers were found.

all_out <- c(Price_out, Avg..Area.Income_out, Avg..Area.Number.of.Rooms_out, Avg..Area.House.Age_out, Area.Population_out)

library(DT)
datatable(
  housing[all_out, ],
  options=list(scrollX=1,pageLength=7,searching = FALSE,scroller = TRUE,scrollY=200)
)

3 Analysis

# Total of 90 outliers were removed from the _housing_ dataset.
# housing <- housing[-all_out, ]

The target variable for analysis is Price. Visualization of all the numeric variable distributions with a plot().

plot(Filter(is.numeric, housing), pch=20, col='blue', cex=.5)

Looks like The price grows linearly to all variables except Avg..Area.Number.Of.Bedrooms, but everything is quite scattered. The independant variables data is scattered, but is still linearly growing. The most promising/clean linear incline to Price happens with Avg..Area.Income.

3.1 Avg..Area.Income

Looking at the relationship between Avg..Area.Income and Price using a ggplot.

library(ggplot2)
housing %>%
  ggplot(aes(x = Price, y = Avg..Area.Income)) +
  geom_point() + 
  xlab("Price ($)") + 
  ylab("Avg Area Income ($/year)")

We can see that the Price increases when Avg..Area.Income is increasing. Let’s add a color to Area.Population.

library(ggplot2)
housing %>%
  ggplot(aes(x = Price, y = Avg..Area.Income, color = Area.Population)) +
  geom_point() + 
  xlab("Price ($)") + 
  ylab("Avg Area Income ($/year)")

housing %>%
  ggplot(aes(y = Price, x = Avg..Area.Income, color = Area.Population)) +
  geom_point() + 
  ylab("Price ($)") + 
  xlab("Avg Area Income ($/year)")

We can see that the price increases when Avg..Area.Income is increasing. In lower population areas the price is lower.

Plots of filtered data:

library(dplyr)

# TODO: find all permutations of independent vars to find homoscedastic relations 
# intervals for every independant variable
# x loop through each independant variable and each interval
# y loop through anothe independant variable

plot(Filter(is.numeric, housing) %>%
  filter(between(Avg..Area.Number.of.Rooms, 3, 5)) %>%
  select(c(
    "Price",
    "Avg..Area.Income",
    "Avg..Area.House.Age",
    "Area.Population"
  )), pch=20, col='blue', cex=.5, main="3-5 rooms")

plot(Filter(is.numeric, housing) %>%
  filter(between(Avg..Area.Number.of.Rooms, 5, 6)) %>%
  select(c(
    "Price",
    "Avg..Area.Income",
    "Avg..Area.House.Age",
    "Area.Population"
  )), pch=20, col='blue', cex=.5, main="5-6 rooms")

plot(Filter(is.numeric, housing) %>%
  filter(between(Avg..Area.Number.of.Rooms, 6, 7)) %>%
  select(c(
    "Price",
    "Avg..Area.Income",
    "Avg..Area.House.Age",
    "Area.Population"
  )), pch=20, col='blue', cex=.5, main="6-7 rooms")

plot(Filter(is.numeric, housing) %>%
  filter(between(Avg..Area.Number.of.Rooms, 7, 8)) %>%
  select(c(
    "Price",
    "Avg..Area.Income",
    "Avg..Area.House.Age",
    "Area.Population"
  )), pch=20, col='blue', cex=.5, main="7-8 rooms")

plot(Filter(is.numeric, housing) %>%
  filter(round(Avg..Area.Number.of.Rooms) == 7) %>%
  select(c(
    "Price",
    "Avg..Area.Income",
    "Avg..Area.House.Age",
    "Area.Population"
  )), pch=20, col='blue', cex=.5, main="7 rooms")

plot(Filter(is.numeric, housing) %>%
  filter(between(Avg..Area.Number.of.Rooms, 9, 11)) %>%
  select(c(
    "Price",
    "Avg..Area.Income",
    "Avg..Area.House.Age",
    "Area.Population"
  )), pch=20, col='blue', cex=.5, main="9-11 rooms")

See on histograms, how the data is distributed.

price_hist <- hist(
  housing$Price, 
  col = "skyblue",
  main = "Price frequency",
  xlab="$",
  ylab = "Count"
)

par(mfrow=c(1,2))
price_hist$counts <- price_hist$counts / sum(price_hist$counts) * 100

price_hist$counts
##  [1]  0.12  0.84  2.66  7.78 13.86 21.08 21.74 17.34  9.22  3.88  1.18  0.28
## [13]  0.02
plot(
  price_hist,
  col = "skyblue",
  main = "Price % distribution",
  xlab="$",
  ylab = "Frequency %",
  cex.main=0.8
)

price_box <- boxplot(
  housing$Price, 
  col = "skyblue",
  horizontal = 1,
  main = "Price box plot",
  xlab = "$",
  range = 1.5,
  cex.main = 0.8
)

price_box_out <- which(housing$Price %in% price_box$out)
price_box_out
##  [1]   91  257  264  356  466  623  694  697  716  902  925  991 1209 1249 1272
## [16] 1357 1460 1486 1517 1537 1579 1662 1800 2301 2539 2720 2757 2796 3092 3213
## [31] 3503 3923 4130 4401 4452
library(DT)
datatable(
  housing[price_box_out, ],
  options=list(scrollX=1,pageLength=10,searching = FALSE,scroller = TRUE,scrollY=200)
)

3.2 Avg. Area House Age

avg_age_hist <- hist(
  housing$Avg..Area.House.Age, 
  col = "skyblue",
  main = "House age frequency",
  xlab="Years",
  ylab = "Count"
)

Also the relative distribution

par(mfrow=c(1,2))
avg_age_hist$counts <- avg_age_hist$counts / sum(avg_age_hist$counts) * 100

avg_age_hist$counts
##  [1]  0.10  0.52  1.62  4.84  9.08 15.50 19.48 19.00 14.36  9.00  4.32  1.72
## [13]  0.40  0.04  0.02
plot(
  avg_age_hist,
  col = "skyblue",
  main = "House age % distribution",
  xlab="Years",
  ylab = "Frequency %",
  cex.main=0.8
)

avg_rooms_box <- boxplot(
  housing$Avg..Area.House.Age, 
  col = "skyblue",
  horizontal = 1,
  main = "House age box plot",
  xlab = "Years",
  range = 2,
  cex.main = 0.8
)

boxplot
## function (x, ...) 
## UseMethod("boxplot")
## <bytecode: 0x55c317bb64c8>
## <environment: namespace:graphics>

Outliers of the Avg..Area.House.Age.

sort(avg_rooms_box$out)
## [1] 2.644304 9.519088
avg_rooms_box_out <- which(housing$Avg..Area.House.Age %in% avg_rooms_box$out)
avg_rooms_box_out
## [1] 1075 3990
library(DT)
datatable(
  housing[avg_rooms_box_out, ],
  options=list(scrollX=1,pageLenght=5,searching = FALSE,scroller = TRUE,scrollY=200)
)

4 Avg. Area Income

Lets compare the Price with the discrete variables using box plots. The number of rooms and the age of the house are discrete if rounded.

par(mfrow=c(1,2))
discrete_vars = c(
  "Avg..Area.House.Age",
  "Avg..Area.Number.of.Rooms",
  "Avg..Area.Number.of.Bedrooms",
  "Area.Population",
  "Avg..Area.Income"
)

for (i in 1:length(discrete_vars)) {
  boxplot(
    housing$Price~round(housing[,discrete_vars[i]]),
    col="skyblue",
    xlab ="",
    ylab ="",
    las=2,
    cex.axis=0.5,
    cex.lab=0.7, 
    main=paste("Price vs" ,discrete_vars[i]),
    range=2
  )
}

5 Linear regression

Let’s get the summary for all of our variables with Price being our dependent variable and the other variables being independent.

model <- lm(Price ~ Avg..Area.Income + Avg..Area.House.Age + Avg..Area.Number.of.Rooms + Avg..Area.Number.of.Bedrooms + Area.Population, data = housing)

summary(model)
## 
## Call:
## lm(formula = Price ~ Avg..Area.Income + Avg..Area.House.Age + 
##     Avg..Area.Number.of.Rooms + Avg..Area.Number.of.Bedrooms + 
##     Area.Population, data = housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -337020  -70236     320   69175  361870 
## 
## Coefficients:
##                                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                  -2.637e+06  1.716e+04 -153.708   <2e-16 ***
## Avg..Area.Income              2.158e+01  1.343e-01  160.656   <2e-16 ***
## Avg..Area.House.Age           1.656e+05  1.443e+03  114.754   <2e-16 ***
## Avg..Area.Number.of.Rooms     1.207e+05  1.605e+03   75.170   <2e-16 ***
## Avg..Area.Number.of.Bedrooms  1.651e+03  1.309e+03    1.262    0.207    
## Area.Population               1.520e+01  1.442e-01  105.393   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101200 on 4994 degrees of freedom
## Multiple R-squared:  0.918,  Adjusted R-squared:  0.9179 
## F-statistic: 1.119e+04 on 5 and 4994 DF,  p-value: < 2.2e-16

We can see that “Average Area Income” coefficient does indeed have the highest effect on price, followed by “Average Area House Age” and “Area Population”.

Let’s show graphs for all variables:

plot(housing$Avg..Area.Income, housing$Price, main="House Price vs Income", xlab="Income", ylab="House Price")
model_HouseIncome <- lm(Price ~ Avg..Area.Income, data = housing)
abline(model_HouseIncome, col="red")

plot(housing$Avg..Area.House.Age, housing$Price, main="House Price vs House Age", xlab="House Age", ylab="House Price")
model_HouseAge <- lm(Price ~ Avg..Area.House.Age, data = housing)
abline(model_HouseAge, col="red")

plot(housing$Avg..Area.Number.of.Rooms, housing$Price, main="House Price vs Rooms", xlab="Rooms", ylab="House Price")
model_HouseRooms <- lm(Price ~ Avg..Area.Number.of.Rooms, data = housing)
abline(model_HouseRooms, col="red")

plot(housing$Avg..Area.Number.of.Bedrooms, housing$Price, main="House Price vs Bedrooms", xlab="Bedrooms", ylab="House Price")
model_HouseBedrooms <- lm(Price ~ Avg..Area.Number.of.Bedrooms, data = housing)
abline(model_HouseBedrooms, col="red")

plot(housing$Area.Population, housing$Price, main="House Price vs Population", xlab="Population", ylab="House Price")
model_Population <- lm(Price ~ Area.Population, data = housing)
abline(model_Population, col="red")

We can see from the graphs that “Average Income” has the most significant increase in price with “Average House Age” being the second most impactful.

There is a slight increase in price with the other 3 variables but it is minimal.

6 Conclusion?

Data shows that the more money people earn the more they’re willing to spend on housing.

There are outliers but on average a person spends roughly:

housing$Percentage <- (housing$Price / housing$Avg..Area.Income) * 100
average_percentage <- mean(housing$Percentage)
average_percentage
## [1] 1788.444

-Percent of their yearly income on housing.

This means that in order to buy the houses they need to save up for 17.88 years of their salary or get a house loan with a duration of 20 years or longer.